Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bail on broken pipe when reading udp streams #509

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

arlyon
Copy link

@arlyon arlyon commented Sep 18, 2024

I have an issue with netavark + aardvark-dns + podman where under certain conditions (in my case a container running tailscale as an exit node) aardvark-dns will end up stuck burning 100% cpu (and taking down container dns resolution in the process).

When it happens, aardvark-dns continually prints log lines along the lines of Error parsing dns message: broken pipe. I can't see anywhere where we bail as a result so this PR attempts to resolve that. I am hoping that if we break the loop and the socket is closed then it will just reconnect and continue to work as usual.

I am still figuring out how to get this into my version of coreOS (since it is quite deeply integrated) so that I may try to repro but I wanted to open this PR in case there was any insight here (or whether this line of reasoning is sound).

Additionally, some advice on writing a test for this would be very helpful. I am trying to set up a basic rust test but it is very involved.

Copy link
Contributor

openshift-ci bot commented Sep 18, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: arlyon
Once this PR has been reviewed and has the lgtm label, please assign mheon for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Tests failed. @containers/packit-build please check.

@@ -87,6 +87,7 @@ impl CoreDns {
},
v = receiver.next() => {
let msg_received = match v {
Some(Err(err)) if err.kind() == std::io::ErrorKind::BrokenPipe => {break},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right as you break the loop for tcp and udp and there is no logic to "retry" (open the socket again). At the very least it would need to be removed from the thread_handles map so a new container spawn would allow us to add it again.

And we likely should check what kind of errors we can get here and which of those or fatal and which are recoverable in general when reading from the socket. EPIPE just doesn't make any sense to me, it is not documented in recvmsg which should be the underlying syscall (or does the tokio abstraction do anything weird)

Can you attach strace to it and provide the output of when this happens?

Copy link
Author

@arlyon arlyon Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will attempt to trigger it again and see what I can do. I have attached a test as well that demonstrates the hang when a broken pipe occurs using just a simple facade.

I agree that simply breaking the loop is not ideal but it at least stops the server from completely hanging.

(also thanks for the quick reply!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without understanding why this happens it doesn't make much sense to just merge "random" fixes and hope the best. I don't see what generates EPIPE here so I like to see a strace at least to understand which syscall is returning that. I agree there might be fatal errors that should cause us to abort but this is not specific to this one I presume and not specific to udp either?

And if we do that then we must update the thread_handles correctly and I guess try to bind the socket again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attached strace on my coreOS instance using a rootful toolbox container and have managed to trigger the issue again but rather stupidly did not trace child threads / processes. Trying again now.

@arlyon arlyon force-pushed the arlyon/break-on-broken-pipe branch 2 times, most recently from a91adc4 to 49dd0b8 Compare September 18, 2024 12:29
@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I looked a bit around and there seem to be some errors hickory considers hard errors when reading from the socket but they do not include EPIPE hickory-dns/hickory-dns@d2e64d8

So I would still like to understand how you can EPIPE here.

}

// we need 2 threads or tokio::spawn will block since it never yields
#[test_log::test(tokio::test(flavor = "multi_thread", worker_threads = 2))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of test_log? It seem this can just be dropped so we do not need extra dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants